CMPT 825 : Natural Language Processing Spring 2008
نویسندگان
چکیده
Hidden Markov Models need a large set of parameters which are induced from a text-corpus. The parameters should be optimal in the sense that resulting models assign high probabilities to seen training data. There are several methods to estimate model parameters. The first one is to use each word as a state and estimate the probabilities using the relative frequencies. The second method is a variation of the first method. In this model, words are automatically grouped by similarity of distribution in the corpus. Each group is represented by a state in the model. The second method has the advantage of drastically reducing the number of model parameters and thereby reducing the sparse data problem. The third method uses manually defined categories. An important difference to the second method with automatically derived categories is that with manual definition a word can belong to more than one category. The fourth method is a variation of the third method and is also used for part of speech tagging. This method does not need a pre-annotated corpus for parameter estimation. The parameters are estimated using the Baum-Welch algorithm. This paper proposes a fifth method for estimating natural language models combining the advantages of the methods mentioned above.
منابع مشابه
Cmpt 825: Natural Language Processing
1.1 Markov Processes Consider a set of states S1, S2, . . . SN . A discrete Markov process is one in which the system is in a particular state at any given time. The state can be changed only at discrete intervals of time. We denote the time instants associated with state changes with as t = 1, 2, . . . and we denote the actual state at time t as qt. The current state in an n-order Markov proce...
متن کاملCmpt 825: Natural Language Processing 12.0.1 Definitions
Tagger Program that tags a word (w i) in a text with its part of speech (POS) tag t i. Precision Precision is the number of correct responses out of the total number of responses. Recall Recall is the number of correct responses out of the number correct in the key. Word Features Features of a word that are used in characterizing the type of entity it is (i.e. whether a word is capitalized, whe...
متن کاملCmpt 825: Natural Language Processing 1.1 Hiding a Semantic Hierarchy in a Markov Model [1] 1.1.1 General Concepts
We know that in logic a predicate is a relation between its arguments. In other words, a predicate defines constraints between its arguments. A predicate ρ(v, r, c) is called selectional restriction where v is a verb, r is a role or an object and c is called a class which is a noun. Selectional preference σ : (v, r, c) → a is a function of these predicates to a real number. Where a shows the de...
متن کاملOn Distributed Concurrent Multi-Port Router Test System
This paper presents a framework of the distributed concurrent multi-port-testing test system (CMPT-TS) for IP routers under development at Sichuan Network Communication Key Laboratory. Having analyzed the actuality of concurrent testing for routers, this paper develops a distributed architecture of CMPT-TS and discusses its functional components in detail. Moreover, a new test definition langua...
متن کاملMachine Reading
Over the last two decades or so, Natural Language Processing (NLP) has developed powerful methods for low-level syntactic and semantic text processing tasks such as parsing, semantic role labeling, and text categorization. Over the same period, the fields of machine learning and probabilistic reasoning have yielded important breakthroughs as well. It is now time to investigate how to leverage t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006